Fast Duplicated Documents Detection using Multi-level Prefix-filter

نویسندگان

  • Kenji Tateishi
  • Dai Kusui
چکیده

Duplicate document detection is the problem of finding all document-pairs rapidly whose similarities are equal to or greater than a given threshold. There is a method proposed recently called prefix-filter that finds document-pairs whose similarities never reach the threshold based on the number of uncommon terms (words/characters) in a document-pair and removes them before similarity calculation. However, prefix-filter cannot decrease the number of similarity calculations sufficiently because it leaves many document-pairs whose similarities are less than the threshold. In this paper, we propose multi-level prefix-filter, which reduces the number of similarity calculations more efficiently and maintains the advantage of prefix-filter (no detection loss, no extra parameter) by applying multiple different prefix-filters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-View Face Detection in Open Environments using Gabor Features and Neural Networks

Multi-view face detection in open environments is a challenging task, due to the wide variations in illumination, face appearances and occlusion. In this paper, a robust method for multi-view face detection in open environments, using a combination of Gabor features and neural networks, is presented. Firstly, the effect of changing the Gabor filter parameters (orientation, frequency, standard d...

متن کامل

Applying mean shift and motion detection approaches to hand tracking in sign language

Hand gesture recognition is very important to communicate in sign language. In this paper, an effective object tracking and hand gesture recognition method is proposed. This method is combination of two well-known approaches, the mean shift and the motion detection algorithm. The mean shift algorithm can track objects based on the color, then when hand passes the face occlusion happens. Several...

متن کامل

Image De-Noising and Micro Crack Detection of Solar Cells

Solar cell is known as a sustainable and environment friendly source of energy in nature. It converts sunlight directly into electricity with zero emission and also without side-effects on the environment. But, solar cells have optical and mechanical defects which include the type of micro crack, the size of crack, and the noise from electrical or electromechanical interference during the image...

متن کامل

Time-Varying Frequency Fading Channel Tracking In OFDM-PLNC System, Using Kalman Filter

Physical-layer network coding (PLNC) has the ability to drastically improve the throughput of multi-source wireless communication systems. In this paper, we focus on the problem of channel tracking in a Decode-and-Forward (DF) OFDM PLNC system. We proposed a Kalman Filter-based algorithm for tracking the frequency/time fading channel in this system. Tracking of the channel is performed in the t...

متن کامل

A New Fast and Accurate Fault Location and Classification Method on MTDC Microgrids Using Current Injection Technique, Traveling-Waves, Online Wavelet, and Mathematical Morphology

In this paper, a new fast and accurate method for fault detection, location, and classification on multi-terminal DC (MTDC) distribution networks connected to renewable energy and energy storages presented. MTDC networks develop due to some issues such as DC resources and loads expanding, and try to the power quality increasing. It is important to recognize the fault type and location in order ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008